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Some Probabilistic and Statistical Problems 
in the Analysis of DNA Sequences 

Simon Tavari 

ABSTRACT. This paper concentrates on statistical 
aspects of the estimation of substitution rates and 
divergence tines on the basis of DNA sequence data. A 
new method of estimation Is suggested* and exhibited 
using data from serum albumin and a-fetoproteln. The 
divergence time of rat and mouse is estimated using a 
tree calibrated by the human-rat divergence time. Some 
inherent difficulties in these methods are highlighted 
by statistical analysis of the sequences. 


1, INTRODUCTION 


This paper is concerned with probabilistic and statistical 
questions relating to the estimation of substitution rates and 
divergence times on the basis on DNA sequence data. Suppose that 
we have two functionally homologous genes taken one from each of 
two species. On the basis of just the observed differences in 
base composition of the two sequences* we want to estimate the 
time of divergence of the two species, and to estimate parameters 
of the evolutionary process that led to these observed 
differences. 

A general model of this process of mutation should take account 
of the relative roles of substitution* insertion and deletion* 
duplication and transposition as forces that change the structure 
of genes over time. Here I will focus only on the effects of 
substitution* the replacement of one base by another. I will also 
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assume that the two sequences under comparison are of the same 
base length n, say. 

We will label the four bases A, C, G, T by 1, 2, 3, 4 

respectively, and let (X^(t), Y^(t)) denote the bases that occur 

at the i th position (1 S i S n) in species 1 and 2 respectively, t 
time units after divergence. The assumption of divergence from a 
common ancestor means that 

X^O) - (0), i - 1, 2. n, (1.1) 

This article is divided into six sections. In Section 2, we 
review some stochastic models for the behavior of the process 
{(X(t), Y(t)), t £ 0} which describes the base composition of two 
homologous nucleotide positions in the sequences. Section 3 and 4 
discuss some statistical questions concerning estimation of 
evolutionary parameters and goodness-of-fit tests for these 
models. The methods are illustrated with respect to the serum 
albumin and ot-fetoprotein genes of man, mouse and rat. In Section 
5 we treat the problem of estimating the divergence time of two 
species on the basis of sequence data when the phylogeny length is 
calibrated by a more distant sequence of known divergence. 
Section 6 contains some concluding comments. 

II. STOCHASTIC MODELS OF SUBSTITUTIONS 

We need to model the stochastic behavior of the process {(X(t), 
Y(t)) , t £ 0} in which t denotes time of divergence from the 

common ancestor, and X(*)» Y(* ) denote respectively the nucleotide 
in homologous positions in sequence 1 and sequence 2. As in 
(1.1), we have X(0) « Y(0) and subsequently X(*) and Y(*) evolve 
independently. 

The substitution process (X(t), t 0} can be described by the 


transition functions 
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Pjj(t) - P[X(t) = j|X(0) - 1], (2.1) 

with a corresponding function for the Y(* ) process. If we define 
f^t) = P[X(t) - i, Y(t) = j|X(0) - Y( 0) ], (2.2) 

then our assumptions readily give 

4 

f u (t) “ p t Y j (t) 

where 

% - P[X(0) = l] £ P[Y(0) - t] . 

It is often assumed that 

Pjj(t) « s PjjU)- (2.5) 

Under this assumption, if we write * (f^t)), and P^ ~ 
( p u(t))- then (2.3) becomes in matrix notation: 



T 

F t - p t F o 

P , t * 0 




(2.6) 

where F Q = diag 

<V V V V 






It remains, 

of course, to 

specify P . 

The 

model 

most 

frequently used 

is the case 

in which (X(t), 

, t 

£ 

0) 

is a 

continuous-time 

time-homogeneous 

Markov chain, in 

which 

case we 


have (cf. Karlin and Taylor (1975), Ch. 4) 

00 

p . ■ e,t '- 2 «” »> »• (*•■') 
n-0 

Here Q ■* (q^) is the generator of {P^); Q satisfies 

- - ?11 2 0; Q1 » 0. 


(2.3) 

(2.4) 


q AJ i. 0 (i * J) ; 


( 2 . 8 ) 
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where 1-(1,1,..,1) T , 0*(0, 0 ,.,h, 0) T . We also wake a statlonarity 
requirement by assuming that K » ) satisfies 

*Q-0 , (2.9) 

If we also assume that (Y(t), t fc 0} has the same stochastic 
structure as X(*> (so that, In particular, (2.5) holds) then the 
marginal distributions of X(t) and Y(t) are Identical (and equal 
to x) for all t. 

From now until the end of Section 3, we will assume that x(-) 
and Y(*) are stochastically Identical. The evolutionary parameter 
of Interest is then the compound parameter K defined by 

4 

K - « 5 (2-10) 

t-1 


Under the statlonarity assumption (2.9), K Is the mean number 
of substitutions per homologous nucleotide site since divergence. 
Of course, If t is known, then the substitution rate can be 
estimated, and vice-versa. 


We will now review some 


substitution rate matrix Q. 
Jukes and Cantor (1969). 


lie 2.1 


of the specific forms for the 
The progenitor of these is due to 


In this case, substitutions occur at the points of a Poisson 
process of rate A, and when a substitution occurs It Is equally 
likely to be to any of the other three bases. Hence 



The parameter K is given by K - 2tA. 
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We would like to relax the assumption of uniform base 
composition and equally likely substitutions. 

Example 2.2 


A model which retains the assumptions of uniform base 
composition, and a Poisson substitution scheme was proposed by 
Kimura (1981) to allow for different transition and transversion 
probabilities. The rate matrix takes the form 

A C G T 



where A ^ a + /J + 7. The special case 7-/5 was studied by 
Kimura(1980)j see also Kimura (1983, Ch. 4). Other cases In which 


the A and T frequencies are equal (as are the C and G frequencies) 
are the four parameter model of Aokl et al. (1981) and the five 
parameter model of Takahata and Kimura (1981). 


Example 2.3 


A model that allows for arbitrary base frequencies and possibly 
different substitution rates was proposed by Kimura (1981), This 
six parameter process has the form 

A C G T 


the diagonal elements being determined by (2.8). Further 
properties of this model may be found In Gojoborl et al. (1962). 
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Feleenstein (1981), In a study of maximum likelihood methods 
for evolutionary trees, uses a generalization of the Jukes-Cantor 
model that also allows for arbitrary base frequencies* In 
generator form, we take a arbitrary, and set 



This model corresponds to embedding a sequence of Independent and 
Identically distributed substitutions in a Poisson process of rate 
A similar example was studied by Tajima and Nei (1982)* 

It has been noted by several authors (Neyman (1971), Kaplan and 
Langley (1979), Felsenstein (1981) among others) that the 
assumption of reversibility of the substitution process affords a 
useful simplification* Intuitively, the observation that x(*) 
(say) Is reversible means that the substitution process viewed 
from now into the future is probabilistically identical to its 
behavior from now back Into the past* Mathematically, the 
stationary Markov process X(-) is reversible if and only if there 
exists a collection of positive numbers a j summing to unity that 

satisfy the balance equations 

s i* J * 4- (2,11) 

When such exist, then a is the stationary distribution of the 

process, l,e,, (2,9) holds* Reversibility is discussed at length 
by Kelly (1979) and Keilson (1980), for example* From (2,7) and 
(2*11) it follows that a^jtt) - 1 £ I, J £ 4, tfe 0, 


and hence we have 
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f u (t) ■ Vu (2t > 


(2,12a) 


or, in matrix notation (cf* (2*6)) 


F P 
0 2t* 


(2*12b) 


The reversibility property Is shared by several of the previous 
examples; it is readily checked that the process with Q matrices 
given In Examples 1, 2 and 4 are reversible. 

This suggests that a general model incorporating the 
reversibility property should be studied: 

Example 2*6 

The generator Q of a reversible process with stationary 
probabilities a - {k t a^, a^, a^) can be expressed as a nine 

parameter matrix 


V !* 2 

*1*8^3 *2 X 4 / *3 


*lV*4 


Vs 7 ” 4 




(2.13) 


satisfying x^ t 0* IS 1 S 8, and the diagonal elements once more 
determined by (2*6)* 

III* ESTIMATION OF SUBSTITUTION RATES 


3*1 Statistical Methods 


Having described the process by which a particular homologous 
site evolves, we now model the stochastic structure of ((X^t), 

Vj(t)), tl 0; 1 » 1, ..*, n}* The simplest assumption here is 

that each pair of homologous nucleotides behaves independently and 
Identically* That Is, assume 


(X^(t), Y^(t)), 1 - 1 .... n are 1 * 1*d* random (3*1) 

vectors with common distribution that of (X(t), Y(t))* 
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It Is well Known that, particularly in coding regions, the 
structure of the base sequence is not that of independent 
identically distributed bases, (cf. Smith et al (1983)), As a 
consequence, it Is customary to analyze the substitution process 
In coding regions according to base position In the codon. Me 
therefore analyze three separate sequences, the first base 
position sequence then being <X 31 _ 2 (t), Y 21-2^ t ^ * 1-1 

where n - 3m. Another reason for studying the sequences by base 
position in the codon involves degeneracy in the genetic code. 
Many substitutions in the third position of codons are silent 
(that is, do not change the amlno-acld the codon represents). One 
■ight therefore expect heterogeneity of the substitution process 
along a coding region, In violation of the i.i.d, assumption 
(3,1). 

For convenience we will denote either the whole sequences or 
the sequences of 1 st , 2 nd or 3 rd codon positions by (X^t) ,Y^(t)), 

l m l .n. Our data now comprise the observations N * (N^j, 

1 £ i, j i 4) where 

- number of times we observe 
X £ < t) - i, Y e (t) - j, min, (3.2) 

Under the assumption (3.1), the (N^) have a joint multinomial 
distribution with parameters n, and given by (2.6). Using 
observations on (N^J we want to find an estimator K of the 
substitution parameter 


K * 2t J ** Q*. 

4-1 * * 


and an estimate of the asymptotic (as n *) variance of K. 

The estimation theory for the Felsensteln model of Example 2.4 
Is readily elucidated. Here we have 
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4 

K - 2tffH, where H - 3' (3.3) 

ltl 1 1 


Since j<t) - e ^ 0^ + (1 - e^*)Kj, it follows that if 

25-ij 

i*j 

Is the number of non-identical nucleotide sites, then D has a 

—2jj t 

binomial distribution with parameters n and p * (1 - e ^ )H. If 
we assume that x, and therefore H, is Known, then the maximum 

likelihood estimator of K is 

Kp - -H In (1 - jj), d - jj . (3.4) 


K e inherits Its asymptotic distribution from that of d. By the 

r 

'Delta method' (cf. Serfling (1980, p. 122)) me find that 

K “ AN(K, ■ l|2p(1 ~| l). (3.S) 

F n(H-p) 2 

2 

where AN(/i, a ) denotes "asymptotically normal, with mean p and 
2 

variance o ." 

In the special case H - 3/4 that obtains when x - (1/4, 1/4, 

1/4, 1/4), this model differs only notatlonally from Example 2.1. 
From (3.4) we obtain the Jukes-Cantor (1969) estimator 



(3.6) 

(3.7) 


The variance term in (3.7) Is due to Klmura and Ohta (1972); In 
practice, p Is estimated from the data by d. If x is assumed 


unknown, and hence must be estimated from the data, then the 
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virtiBct oT Kp ie do l«i|tr (Ivin by the appropriate tin in 

(3.5), The correct ton con readily bo calculated numerically. 

The animation theory of Example 2.2 la given by Klaura (1261). 
Tab!* 9-1 dticrlbet this model in more detoll. 

Tefcle 2.1 

Comblootlone of boaea 

miiHMl Tfanamoa Km _TrinmrH?n Usa. 

X T C A 0 TACO TOCA 

_ 1 _ CTM _ HOC _ 

frequency P 0 R 

With title notation, the estimator of X la 

t ^ - -1/4 In J(1-2P-2Q)(1-2P-2R){1-2Q 2RIJ, (3.6) 

and the ■aynptotJt variance la given, by Klaura'a equation [12], 
Kaplan and Blake (1962) proposed on internet lag alternative 
approach to the estimation of substitution rotoa. Suppose that 
the 0 matrix la of the fora Q - MR - !)♦ share ■** * 

stochastic matrix Pith 1 4 1 3 *■. The substitution 

parameter la K - lit, and their estimator of K la 

- 2(1 " J\ * 10(1 - djj, d * J (3.9) 

and 

Sw ***** B(l-d)<l+la(l-d) K 

Thle estimator ms derived by approx!mating the fora of P f+ and 
it itunld apply to cases la mtllch d la Close to ti* See also 
Kaplan {1963)* Eat loot Ion theory for Example 2.4 la dlacuaaed In 
detail by Gojoborl et al. (19821* 
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a.z at aanlMi ssiil 

At final can# m couldtr ham la for a general rtvenlble 
aodtl of funpH S-Br Af« «m tHtntlilly tnt dirftrmt 
approach#* to thin, depending on what la ateuaed about w. It tn 

dtflnt 


than uader the aodel of (2.12) sod (2.13) th# Joint distribution 
of IMjj 1 la aultinonUl. with parameters s, and lay. where 


i| wf 1 P ll < 2 t:l * J - i f 

f,j - ta.Uj 

» lPlJ t4t>. li t <jia « 

If * ic uiiwh) unknown, and hence baa to he estimated froa Che 

data, the Statistical probln reduce* Co the Satinet Ion of the 
nine pa master* (la^ tig. .... ta fl . w . w f . of th# Q-iatrin 

In (2.is) froa the multinomial data f H jjK with cell probabilities 

It can be ahown that the aonfanw likelihood eatlaatnra of 

thtif pinieiert wmy Qftvn bt feund by »Qlvln* th* 

■ _1 i * r 0 «piQ> ts.ii) 

where ■ (N ( j * and 0 la given by a matrix of the fora 

(2.13). Computationally, All la straightforward becauae such 

m 

q- mstrite* are diagonal!sable* eo that nip(q) any be computed 

easily. cf. Kellion ( 1070 ). p„ M-M, for example. The Joint 
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asymptotic distribution of the estimators then follows fro* 
standard theory. 

While (3.12) provides a simple method for estimating the 
parameters, it is not always true that (3.12) has a solution 
satisfying the restrictions of (2.13) (for example, if some 
M ij " O' 8ee Table 3.2). In these cases, estimation of the 

parameters is more complicated. 

When it is assumed known, as in the models of Felsensteln (1981) 

and Klmura (1981), then we can approach the problem in a different 
way. Our basic data remain the multlnomially distributed 
observations {M ±J >. and the cell probabilities (t^) determined 

now by just six (compound) parameters (tx. tx_). We can 

1 0 

estimate them by using, for example, minimum chi-squared 
estimation or least squares estimation techniques. 

In either case, we arrive at estimates of the elements of Q 
which have a joint asymptotic distribution that is multivariate 
normal, the parameters of which can be eetlmated from the data* 
Hence we can also estimate the substitution parameter K* Some 
examples of the method are given in the next section, and detailed 
discussion of these and related methods appears in Tavare and 
Janzen (1966)* 

3,3 Some data 

In this section we will Illustrate the results of these methods 
with two data sets. The genes are a-fetoproteln (Human [Morlnaga 
et al* (1983)], Rat [Jagodzinskl at *1,(1981)] and Mouse [Law et 
al. (1981)] and Serum Albumin (Human [Dugaiczyk et al. (1982), 
Lawn et al, (1981)], Rat [Sargent et al* (1981)], and Mouse)* 
Estimates of K using different estimators are given In Tables 3,2 
and 3.3. 

The results in Tables 3*2 and 3*3 are qualitatively similar to 
those found by other authors. Because of the chemical structure 


of DMA and the degeneracy In the genetic code, one would expect 
that in coding regions the second base should have the lowest rate 
of acceptable substitutions, and the third base the highest rate* 
All the estimators give similar results either when the divergence 
time or when the substitution rate is small* The most noticeable 
differences occur for estimates of high rates, where the models 
with fewer parameters give lower values of K* 

Table 3*2 


Estimates of K (and standard deviation) for Serum albumin** 
Estimator _ Base position in codon (n ■ 808) _ 


K 


1 


2 


3 

JC 

(9.6), (3.7) 

*1782 

(.0136) 

*1387 

(.0162) 

.6866 

(.0483) 

F f 

(3.4), (3,8) 

.1786 

(.018.7) 

.1392 

( .0163) 

,6673 

(.0484) 

K 

(3.8) 

.1760 

(.0186) 

*1369 

( ,0163) 

.7230 

(.0642) 

KR 

(3.9), (3.10) 

,1778 

(.0192) 

.1403 ( .0166) 

.6967 

(.0649) 

Revertlblw 

(3.12) 

*1794 

(.0196) 

*1418 

( .0169) 

.7274 

(.0669) 


* Base length 1824 bases* Estimates based on Rat»Man 
data. 

t standard deviation assuming x is unknown la same as 
that given to 4 d*p* 

••Estimation of parameters not possible by method of 
(3.12), since no GT or TG sites were found in data* 
Figures given here correspond to sequence with one GT- 
slts added to the sequence. 




JC 


.2298 

(3.6), 

(3.7) 


F 


.2303 

(3.4). 

(3.3) 


K 


.2324 

(3.8) 



KR 


.2342 

(3.9), 

(3.10) 


Reversible 

.2343 


( 3 . 12 ) 


(,0224) .1614 ( 
(,0220) .1921 ( 
(.0229) .1936 ( 
(.0232) .1943 ( 
(.0234) .1967 ( 


0200) .4840 (.0377) 
0201) .4846 (.0378) 
0203) .3173 (.0432) 
0206) .3048 (.0411) 
0212) .3203 (.0438) 


*Baee length 1738 bases. fiitlHUi bated on House-Nan 
data. 

tStandard deviation assuming it It unknown is sane at 
that given to 4 d.p. 

IV. A STATISTICAL LOOK AT THE SUBSTITUTION PROCESS 

The previous sections of this article have described in some 
detail a clast of Harkovlan stochastic models that have been used 
to estimate substitution rates from sequence data, In this 
section, I want to look briefly at some statistical problems 
associated with the selection of classes of processes that 
adequately describe (in a statistical sense) the observations. 

The data ueed In studies of the type described here involve 
observations taken at a single time point, t. On the basis of 
such data, we want to assess something of the nature of the 
substitution process over time. One particular question is 
whether the aubetltutlon process has proceeded at the sane rate In 
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both species, 
possibilities, 


Example 4.1 


The following examples Illustrate the 


We suppose that the substitution process Is described by (2.3), 

v v 

where - (Pjj(t)) and P yt ■ (Pjj(t)) are the transition 

matrices of Markov processes of the fora (2.7). We will look at a 
variation of the Feleenstein model of Example 2.4, In which the 
generators Q x and Q y are given by 



T 

where Q1 - 0, and * 1*1. From (2.3), we have 


P t " P Xt P 0 P Yt (P 0 “ diag( *l.V> ' 

V Q v t 

- Fq e e (by reversibility) , 

(Qjj+QyJt 

" F Q e (since Q x , Q y commute) r 

0V" y >Qt 

" *o e 

For this model, the mean number K of substitutions per site Is 

K - </* x + A( y )Ht. K - 5 V 1 " V' 

An estimator of K Is the Felsensteln estimator K described 

r 

by(3.4); note that it Is based solely on the number of sites 
showing non-identical bases. The parametere and /J y are 

confounded in the definition of K, and Identical estimates of K 
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can arise from models with equal substitution rates or with widely 
different rates. 

Example 4.2 

A simple modification of the Jukes-Cantor process described in 

Example 2.1 Is to make the substitutions In one gene occur at the 

points of a non-homogeneous Poisson process with Intensity 

function A(u), u >_Q, while those in the other gene occur at the 

points of a Poisson process of rate A. The mean and variance of 

the number of substitutions per homologous nucleotide site Is then 

K - At + J* A(u)du* If A(u)du - At, (for example. If A(u) ■ 
o o 

A/2 (0 <_ u <^t/2); 3A/2 (t/2 < u <_t)) then the data will be 

statistically Indistinguishable from those produced by the 
standard Jukes-Cantor process. 

These two elementary examples suggest that care should be taken 
in making inferences about the substitution process on the basis 
of data taken at a sing le time point. However, some assumptions 
of the models of Sections 2 and 3 can be checked by a 
non-parametric approach. 

To describe these methods, we return to the basic description 
of (2.3)* Dropping the t's for notatlonal convenience, we have 

m l** p ?i p Ij• (41) 

Under (4.1), the marginal distribution of X Is 

f i+ - P[X - 1] - : 1 * 1 s 4, (4.2) 

4 

while that of Y is 

f +J * P[Y - j] - 4. (4.3) 

If. as In (2.8), p*j - p][j ■ P tJ for all 1 and J then P - (f^) la 
symmetric, and the marginals of X and Y will be Identical. Notice 


that we only require equality of and P^ (for all 1, j) for 
the single (special) tine point t( recall Example 4.2. 

4.1 Contingency table Methods 

Under the aasunptlon (3.1) of Independent and Identically 
distributed nucleotide sites, the observation matrix N“(N..) 

^ J 

defined by (3.2) has the fora of a contingency table, with 
underlying cell probabilities P - (f AJ ). Some questions of 

Interest to our Modelling problem may now be re-expressed as 
hypothesis tests about the structure of (two-way) contingency 
tables. Perhaps the most useful test of this type Involves the 
test for symmetry of F. Several such tests have been proposed, 
but the simplest one for our purposes la that devised by Bowker 
(1946). Under the null hypothesis that F Is symmetric with f^+ 

f,. > 0, he established that the statistic 


11 ^ J1 


Kj 


’ij "ji 


(4.4) 


Is asymptotically n distributed as \ 2 with 6 degrees of 

freedom, In Table 4.1, we give observed values of the 
statistic for the data used In Table 3.2 and 3,3. 
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Table 4.1 V ii “ "i* + M + 1 ~ 2 N ii' V ij “ -<"ij + V' 1 + 


2 

Observed x values for 

test of 

symmetry (4.4)** 

If d - (dj, dj* d^), then the statistic 

Sequences 

Base position 


S 2 - d T V _i d (d.5) 


1 

2 

3 

* w 

Albumin 

20.17 

5.49* 1 * 

55.33 

has asymptotically a distribution with 3 degrees of freedom 

(Rat-Man) 




under the null hypothesis of marginal homogeneity. Table 4,2 

a-fetoprotein 

(Nouse-Man) 

4,35 

1.72 

45,82 

gives the observed $ 2 values for our data sets. 





Table 4.2 

* 2 

5% significance point of x A - 

12.09 


2 ■ ■ i 

Observed $ values for test of marginal homogeneity (4.6),* 

2 

1% significance point of x A ■ 

16.81 


Sequences Base position 

5 degrees of freedom 




-—_l_ i _3_ 


The results of this screening suggest that for the third 
position data, the Markovian models described In Section 3 are not 
appropriate. 

Of coursei we may have marginal homogeneity (that is, f - 

f + j. V 1) without symmetry. Such behavior Is exhibited, for 

example by Markovian models in which the initial distribution of X 
and Y Is the stationary distribution of both and Once 

more, several methods to Judge the hypothesis of marginal 
homogeneity have been proposed. Maximum likelihood and minimum 
discrimination information approaches use iterative methods (cf. 
Ireland et al, (1969), Madansky (1963). See also the approach of 
Grizzle et al> (1969)). However, a simple test statistic has been 
described by Stuart (1955). Define 


Albumin 

13.19 8.19 

04.86 

(Rat-Nan) 



a-fetoproteln 

2.31 1.03 

45.04 

_(Mouse-Man1 




2 

5* significance point of x~ - 7,82 

3 

1* significance point of * 11.34 


Note that the third position data exhibits high marginal in- 
homogenelty, suggesting once more that the Markov models analyzed 
in Sections 2 and 3 are not appropriate. 

Note that a process in which F has neither the marginal 
homogeneity nor the symmetry property could still be generated by 
a time-homogeneous Markovian scheme, but the generators and 


V N +J ^ N ij and V N i + - N + i- lml > 2 - 3 - 

j 1 


Let be a 3 x 3 matrix with elements 


should be different, and x cannot then be the stationary 

distribution for both X and Y (for then, marginal homogeneity 
would obtain). It is worth noting that in this case, the quantity 
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K In (2.10) that we have tried to estimate Is no longer the 
mean number of substitutions per site In time t, but should be 
Interpreted In an asymptotic sense. 

Finally, even If the assumption of marginal homogeneity Is 
reasonablei we can further teat for the form of the resultant 
marginal distribution. For example, one property of the Klmura 
model of Example 2.2 Is that the marginal distribution Is x * 

(1/4, 1/4, 1/4, 1/4) ♦ We may test such an assumption within our 
contingency table framework by testing F for given marginals (In 
this case, both being x given above). Methods for testing for 

given marginals are described by Ireland and Kullback (1968), for 
example. These methods are, once more, Iterative In spirit; 
rather than record the details, we give In Table 4.3 the values of 
the goodness-of-fit statistic for the data of first and second 
codon positions. The third position data are omitted, since 
marginal homogeneity Is ruled out by the results of Table 4.2. 

The results in Table 4.3 show that the data are incompatible 
with x - (1/4, 1/4, 1/4, 1/4). From the point of view of 

estimating K within the Markovian framework, this might not seem 
to matter; in both first and second base positions, the data In 
Tables 3.2 and 3.3 have similar estimates of K for many underlying 
models. However, one question of interest involves estimation of 
transversion and transition rates. These estimates are based on a 
more detailed examination of estimates of the elements of Q, and 
such estimates are particularly sensitive to departures from the 
underlying form of k. 


Table 4,3 


Observed value of 

test of given marginals 

it - (1/4, 1/4, 1/4, 

* 

1/4) In both species. 

Sequences 

Base position 

1 2 

Albumin 

53.84 

75.06 

(Rat-Man) 



a-fetoprotein 

24.60 

81.05 

(Mouse-Man) 



* 2 

5* signficance point of Xg 

• 12.59 

2 

1* signficance point of x 

- 16.SI 


4.2 Independence 

The contingency table analyses presented here depend a good 
deal on the assumption that homologous sites behave Independently 
and Identically. This assumption then allows us to use standard 
asymptotic results for contingency tables. Several authors have 
studied the effects of serial dependence on the asymptotic 
behavior of such 'standard' contingency table test statistics. 
The results of these studies suggest that departures from 
independence can cause serious distortions In the ’usual' x 2 tests 
(cf. Tavar€ and Altham (1983), Tavar$ (1984), Gleaer and Moore 
(1983, 1984)). We therefore analyzed the base composition of each 
gene In each species of the basic data set described In Section 

3.3 using standard Markov chain methods; cf. Chatfleld (1973). 

The results Indicated, perhaps surprisingly, that each of the 

three codon position sequences Is not Inconsistent with 
Independence. While marginal Independence of this type is not 
sufficient to establish the stronger Independence required In 
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(3.1), the results indicate that for the sequences examined here, 
(3*1) may not be unreasonable* 

V* ESTIMATING DIVERGENCE TIME FROM A CALIBRATED TREE 

Suppose now that we have homologous sequences from three 
species, and that the species have the known phytogeny displayed 
in Figure 5*I * 


Figure 6,1 



In Figure 5*1, X^ denotes the nucleotide appearing at a particular 

homologous site In species l t 1-1, 2, 3* Me will assume that 
the phytogeny Is calibrated (perhaps from the fossil record, as In 
Jacobs and Pllbeam (1980)), In that t <■ t^ + t g Is assumed known. 

The problem is to estimate the divergence time t^ of species 2 and 

3, and also to estimate the variance of this estimate* 

For simplicity, we will assume that the substitution process 
leading to the observations (Xj, X^, X^) has the same stochastic 

structure In each arm of the tree, and that each process is 
Markovian with transition matrix P # - (p^j(s)) t as described by 

one of the models In Section 2. If we define 
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f ijk " p(x i “ h - i' h - k >- 

then by conditioning on the ancestral nucleotide at positions A 
and B In Figure 5.1, we have 

f ljk " 1 *t p ri (t l + V 5 p rz <V p zj <V p zk t 5 ' 1 * 

r z 

where Is the probability of base r at node A, If we assuae 

once aore that w Is the stationary distribution for P t , and Is 

reversible, then Feleenstein'e (1981) "Pulley Principle" reduces 
(5.1) to 

f ljk “ } *z p zi (t 2 + 2t l> p Z j ( V p zk<V • (5 - 2 > 

z 

To proceed further, we assume a model like Felsensteln's given 
In Example 2*4. The simple structure allows us to compute (5*2) 
easily* Under the assumption that each homologous site behaves 
independently and Identically, the random variables N^ given by 

N ijk “ nuniber of times we observe x i " 1 * x 2 " ^ 1 X 3 " k 
In the n homologous sites 

have a Joint multinomial distribution with parameters n and f 

ijk 

1 * i, J, ki 4, Once more using + T to denote summation over 
that Index, define 


N 12 “ 2 N ll* “ *< X 1 “ X 2> - 

1 

N 13 “ 5 M i + i " *< X 1 " V * 


S 23 " 1 N + ii “ *< X 2 " V 
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Hence we may use as an estimator of tg the quantity 

t 2 - t In *7g/ln q 1 (8.4) 

where ^ - q[l/2(<» 12 + <J ig ) - <1 - H)], q % - jj[d 23 - (1 - H)], and 

K « ^ - ir^) is assumed known. 

The asymptotic variance of can be estimated from the 

asymptotic joint normality of (d^, d J3 , d 23 > using the 

multivariate delta method yet again; numerical values are readily 
evaluated on a computer. 

To give a flavor of the results, we present In Tables 5.1 and 
5.2 the estimates of the divergence time of rat (X^) and mouse 

(X.) based on a tree calibrated by the known divergence time t of 

o 

man (X^) and rat, using the data for a-fetoprotein and serum 

albumin described In Section 3,3, 

The discussion of Section 4 suggests that the assumptions made 
In arriving at the estimates In Table 5,1 and 5,2 may not be 
appropriate for the 1st and 3rd codon position data; recall the 
inherent asymmetry Involved In 3rd positions. The second poaltlon 
leads to estimates of 14,5 (serum albumin) and 33.1 
(a-fetoproteln) million years (MY) for the divergence time of rat 
and mouse. A common estimate, baaed on both genes, is then about 
23.8 + 4.87 MY, This figure la somewhat larger than the 8-14 MY 

figure suggested by Jacobs and Pllbeam on the basis of fossil 
evidence, 
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There are Many directions In which analyses of this sort can be 
extended. Most obviously, we could use other Markovian models of 
the type described in Sections 2 and 3. The general reversible 
Models seen particularly tractable; a crude estinate of t^ for the 

above data is 14*4 (serum albumin) and 32.6 (a-fetoproteln) MY, 
with an average of 23.6 MY; this differs little from the simpler 
Pelsensteln model's results. Kaplan and Rlsko (1982) extend their 
method for two species (cf. 3.9) and 3.10)) to the case of m 
species with known phylogeny; their approach could easily be 
modified to attack the present problem, too. 

Prom a statistical point of view, the methods described In 
Section 4 will apply equally well In this setting; the analyses of 
contingency tables suggested there carry over to three- and higher 
dimensional tables also. The stochastic methods here can also be 
extended to cover the case of 4 or more species with known 
phylogeny. 

VI. CONCLUSIONS 

This paper has given a rather bald account of the mathematical 
and statistical aspects of one problem in the theory of molecular 
evolution. Without a doubt, the mathematical models studied here 
are grossly simplified. Nevertheless, the vast amounta of data 
available on DNA sequences suggest that useful models can be 
developed. The statistical approaches outlined here should be 
useful in finding parsimonious descriptions of the data. 

I have not touched on some related aspects of the central 
problem. In particular, there are several studies focussing on 
the estimation of transition and transversion probabilities; cf. 
Pitch (1980) and Holmqulst (1983). Estimation In the Markov 
models discussed here provides another statistical approach that 
may prove useful. 


One area which we are studying involves the fitting of more 
general models that allow for the observed asymmetry In the data 
of third codon positions. Such models will also allow us to 
assess the stability of estimates of divergence times based on 
sequence data; Tavar£ and Janzen (1986). 

The difficult and challenging problema of statistical 
estimation of the phylogeny Itself have not been described here. 
Pelsensteln (1963) provides an excellent overview of this area. 
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